joint-task self-supervised learning
Joint-task Self-supervised Learning for Temporal Correspondence
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region-and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.
Reviews: Joint-task Self-supervised Learning for Temporal Correspondence
The work does not include original ideas. It is exclusively a collection of previous ideas combined together in a rather classical way. Major remarks: Equation (6) makes loss non-smooth and non-differentiable. The authors do not discuss how they handle this. I assume they use the typical approach by getting the right'case' in the forward step and then doing back-prop on the fixed smooth function.
Reviews: Joint-task Self-supervised Learning for Temporal Correspondence
The paper presents a new approach to tracking and pixel level correspondence using self-supervised learning in video. It goes in the direction of multi-task learning. As well results are solid. The reviewers at the beginning gave a score of 5,6 and 7, than after rebuttal also the more skeptic reviewer was convinced to improve its rate. .
Joint-task Self-supervised Learning for Temporal Correspondence
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking.
Joint-task Self-supervised Learning for Temporal Correspondence
Li, Xueting, Liu, Sifei, Mello, Shalini De, Wang, Xiaolong, Kautz, Jan, Yang, Ming-Hsuan
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking.